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Abstract: Autoimmune diseases are often intractable because their causes are unknown. Identifying which genes con- 
tribute to these diseases may allow us to understand the pathogenesis, but it is difficult to determine which genes contrib- 
ute to disease. Recently, epigenetic information has been considered to activate/deactivate disease-related genes. Thus, it 
may also be useful to study epigenetic information that differs between healthy controls and patients with autoimmune 
disease. Among several types of epigenetic information, promoter methylation is believed to be one of the most important 
factors. Here, we propose that principal component analysis is useful to identify specific gene promoters that are differ- 
ently methylated between the normal healthy controls and patients with autoimmune disease. Full Automatic Modeling 
System (FAMS) was used to predict the three-dimensional structures of selected proteins and successfully inferred rela- 
tively confident structures. Several possibilities of the application to the drug discovery based on obtained structures are 
discussed. 
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INTRODUCTION 

Autoimmune disease is defined as "a clinical syndrome 
caused by the activation of T cells or B cells, or both, in the 
absence of an ongoing infection or other discernible cause" 
[1]. As this robust definition suggests, the causes of autoim- 
mune diseases are not generally known. However, the ge- 
netic background of patients is generally believed to be im- 
portant [2]. 

Furthermore, since the concordance rate between 
monozygotic (MZ) twins is not always high [3], some ge- 
netic mechanisms other than simple coincidence between 
DNA sequences have been sought. Recently, epigenetics was 
regarded to be a key factor in understanding the regulation of 
autoimmune diseases [3, 4]. Among these mechanisms, pro- 
moter methylation is regarded as the most critical factor [5]. 
Although studies have suggested that common genes under- 
lie the pathogenesis of multiple autoimmune disorders [2], 
there have been no reports of common methylation patterns 
in more than one autoimmune disease. 

Recently Javierre et al. [6] investigated white blood cells 
extracted from the patients of three autoimmune diseases, 
systemic lupus erythematosus (SLE), rheumatoid arthritis 
(RA) and dermatomyositis (DM). However, disease-specific 
promoter methylation was only identified in SLE [6]. Thus, 
it appears there are no common-methylated promoters be- 
tween these three autoinmiune diseases. In this paper, we 
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reanalyzed Javierre et al. [6]'s methylation data of autoim- 
mune diseases and found disease-specific promoter methyla- 
tion in RA and DM. We also identified a common methyla- 
tion pattern in the three autoimmune diseases. The selected 
genes were evaluated based on their protein structure predic- 
tion by FAMS [7, 8]. The selected genes were also compared 
with previous studies regarding the individual gene. This 
study investigated the relationship between the selected 
genes and the proteins commonly expressed in more than 
one autoimmune disease [9]. The results presented here 
strongly suggest the validity of our findings. The possibility 
of using this application for drug discovery based upon 
FAMS will also be discussed. 

MATERIALS AND METHODS 

Promoter Methylation Profile and Principal Component 
Analysis (PCA) Based Selection of Genes 

The supplementary file Supp_Data_2.xls, which can be 
downloaded from the journal website [6], was used for 
promoter methylation profile. A tab-limited csv file was 
generated and was loaded to R [10] by read.csv function. All 
59 samples were employed for further analysis (Table 1). 

PCA is a kind of ordination method. In ordination 
method, objects placed in the high dimensional space is 
aligned onto the newly generated lower dimensional space. 
In PCA, projection to the lower dimensional space is per- 
formed by simple linear transformation. PCA was applied to 
promoter methylation profiles. Principal component (PC) 
scores were attributed to probes that expressed the gene 
promoters. PCA was also applied to promoter methylation 
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Table 1. Information of samples used. The numbers in parentheses indicate age. M and F in the column identified as gender are 
male and female, respectively. Samples annotated as SLE, RA and DM are twins with disease. Those identified as Healthy 
correspond to healthy twins. Control 1 and Control 2 are age and gender matched controls. 



ID 


gender 










SLE 


34 


F 


SLE (34) 


Healthy (34) 


ControU (33) 


Control2 (36) 


20.1 


F 


SLE (20) 


Healthy (20) 


Controll (23) 


Control2 (22) 


11 


F 


SLE (12) 


Healthy (12) 


Control 1 (10) 




29 


F 


SLE (29) 


Healthy (29) 


Controll (24) 


Control2 (31) 


20.2 


F 


SLE (20) 


Healthy (20) 


Controll (22) 


Control2 (22) 


RA 


4 


F 


RA(43) 


Healthy (43) 


Controll (41) 


Control2 (48) 


12 


M 


RA(9) 


Healthy (9) 


Controll (9) 


Control2 (7) 


80 


M 


RA(12) 


Healthy (12) 


Controll (14) 


Control2 (12) 


85 


F 


RA(18) 


Healthy (18) 


Controll (21) 


Control2 (24) 


86 


F 


RA(6) 


Healthy (6) 


Controll (6) 


Control2(10) 


DM 


16 


M 


DM (34) 


Healthy (34) 


Controll (33) 


Control2 (34) 


33 


M 


DM (13) 


Healthy (13) 


Controll (11) 


Control2(17) 


103 


F 


DM (3) 


Healthy (3) 


Controll (9) 


Control2 (9) 


127 


F 


DM (11) 


Healthy (11) 


Controll (8) 


Control2(12) 


129 


F 


DM (5) 


Healthy (5) 


Controll (9) 


Control2 (8) 



profiles subdivided into sample sets, which corresponded 
with each disease, SLE, RA and DM. 

Probes (promoters of genes) that were outliners along 
either the second or the third PC scores were selected as be- 
ing significantly and differently methylated between the 
normal control and disease groups. The selection of PCs de- 
pended upon which axis represented the difference between 
disease and normal controls. Further details about thresholds 
and criteria for the selection can be found in Supporting In- 
formation. 

/ Test for Different Methylation of Probes 

t test was used to determine whether a diseased or healthy 
twin had differently methylated promoters for the selected 
probes. The algorithm used was as follows: is the number 
of probes and xiji^ is the amount of methylation of the ith 
promoter for the jth individual who belongs to the kth sub- 
group (k corresponds to disease, either SLE, RA, or DM). 
Here, j is either 1 (disease twin) or 0 (healthy twin). Then we 
defined the amount of differential methylation Axik between 
the diseased twin and the healthy twin as 

^ik ~ ^ilk ~ ^iOk ■ 

When we selected probes (' e S' as being differently 
methylated and ('" e S" as not, P values were determined by 



the comparison between the probes in S' and the probes in 
S" by f test. 

Protein Structure Prediction and Protein Complex Infer- 
ence Based on FAMS 

Selected genes' amino acid sequences were downloaded 
fi'om SWISS Prot. Then their three-dimensional protein 
structures were inferred by FAMS [7, 8]. 

Whether a pair of model proteins used for structure pre- 
diction could form protein complexes or not was determined 
as follows. First, Protein Data Bank (PDB) files that con- 
tained at least one model protein as a member of a protein 
complex were downloaded. Then, the model proteins that 
were included in the common PDB files were investigated. 
Next, inter-atomic distances between pairs of model proteins 
that belonged to the same PDB file were computed. If there 
was at least one pair of atoms whose distances were less than 
3.5 A, the pair of model proteins was listed as a candidate to 
form a protein complex. 

"in silico" Ligand Screening 

First, amino acid sequences registered in PDB with more 
than 95 % sequence similarity with a reference protein were 
listed. Among those PDB structures, those with more than 30 
heavy atoms and without any atoms within Van der Waals 
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radius were selected. Then we analyzed PDB files that had at 
least one ligand binding. After removing ligands from these 
selected PDB files, chooseLD [11] was computed with alter- 
native ligand candidates. Then, FingerPrint Alignment scores 
(FPAscores) were used to measure ligand candidates. 

Validation of "in silico" Ligand Screening with 
ChEMBL Database 

To validate the results of "in silico" ligand screening, we 
used an assay where ligand binding to the reference protein 
was investigated in ChEMBL [12]. Then K, values of ligands 
were obtained from ChEMBL [12]. ChEMBL is an Open 
Data database containing binding, functional and "Absorp- 
tion, Distribution, Metabolism, Excretion, Toxicity" (AD- 
MET) information for a large number of drug-like bioactive 
compounds. Ki is the inhibition constant and represents the 
ability to reduce the function of a protein. The correlation 
coefficients between /T/s and FPAScore were used as a meas- 
ure to represent the accuracy of "m silico" ligand screening. 

RESULTS 

PCA Application for Whole Samples 

First, PCA was applied to all 59 samples (Fig. SI). It 
produced a one-dimensional structure and a barb-like struc- 
ture was observed that extended to the right end of the one- 
dimensional structure towards the middle-bottom. To under- 
stand what this barb-like structure means, we drew the con- 
tribution of each sample to PCs (Fig. S2). In contrast to the 
first PC, which had almost constant contributions from each 
sample (Fig. S2(a)), the second PC had contributions with 
opposite signs, dependent upon the gender of each sample 
(Fig. S2(b)). Thus, it is probable that the barb-like structure 
expresses differences between genders. This is likely, since 
genes located on the X chromosome are concentrated in this 
barb-like structure (red dots in Fig. Sl(a)). Once genes lo- 
cated on the X chromosome are removed, the barb-like struc- 
ture disappears (Fig. Sl(b)). 

PCA Application to SLE Samples 

To select genes differently expressed between normal 
controls and SLE patients, we applied PCA to SLE samples 
(IDs 34, 20.1, 11, 29 and 20.2 in Table 1). Figure S3(b) 
shows two-dimensional embeddings of probes spanning the 
second and third PC scores. Since the first PC did not repre- 
sent a difference among probes but had an almost constant 
value for all probes, we hereafter excluded the first PC 
scores from further analysis. Although it is not as clear, a 
bump-like structure expands from the origin toward the 
negative direction of the second PC scores. To understand 
the biological meaning of this bump-like structure, we drew 
the contribution of each sample to the second PC (Fig. S4). 
SLE samples were subdivided into five subgroups, each of 
which consisted of twins and age/gender-matched controls. 
Disease twins (O) always had the largest second PC scores 
within each subgroup (Fig. S4(a)). Thus, the second PC 
scores were expected to express any differences in gene 
promoter methylation between diseased twins and controls. 

To understand the differences, we selected 58 probes 
located in the outer region along the second PC (red dots in 



Fig. S3), and apphed the t test to determine whether methyla- 
tion was distinct between probes in the outer regions and 
others (Fig. S5). For all five subgroups, the selected probes 
(red) were significantly demethylated in diseases from other 
probes (black). This supports the general beUef that genes 
that cause autoimmune diseases should be overexpressed, 
because methylation is believed to suppress gene expression. 
Thus, PCA-based probe selection correctly selected the dif- 
ferently expressed probes. 

PCA Application to RA Samples 

We repeated the same procedure for 20 RA samples (IDs 
4, 12, 80, 85 and 86 in Table 1). The results are shown in 
(Figs. S6, S7 and S8). Although it is less obvious than for the 
SLE samples, the 53 selected probes form a bump-like struc- 
ture along the second PC score axis (Fig. S6(b)). These 
probes were significantly demethylated in the diseased twins 
(O) compared with the healthy twins (A) as expected (Fig. 
S8), excluding the subgroup with ID: 85. The striking differ- 
ence between the SLE samples and the RA samples is that 
the RA diseased twins did not have distinct second PC scores 
compared with the age/gender matched controls (+ and x. 
Fig. S7), while the SLE diseased twins did. This suggests 
that RA diseased twins do not have genes that are methylated 
differently from normal controls and that the difference be- 
tween RA diseased twins and the group consisting of healthy 
twins and normal controls was not significant. This may be 
the reason why Javierre et al. [6] did not find any probes 
expressed differently or significantly between controls and 
RA patients. Again, the PCA-based probe selection success- 
fully selected differently expressed probes. 

PCA Application to DM Samples 

We repeated the same procedure using 20 DM samples 
(IDs 16, 33, 103, 127 and 129 in Table 1). The results are 
shown in (Figs. S9, SIO, Sll and S12). Compared with the 
former two cases using SLE and RA samples, the DM sam- 
ple was harder to analyze. First, as can be seen in (Fig. Sll), 
the second PC did not express differences between the con- 
trol and disease groups but rather between genders. This 
suggested that the difference between disease and control 
groups was not even of secondary importance. The differ- 
ence between disease and control groups was only observed 
in the third PC (Fig. Sll(b)). Second, along the third PC 
score, the bump-like structure was less clear, even compared 
with the RA samples (Fig. S9(b)). This prevented us from 
selecting a clear threshold for outliers. Thus, we tentatively 
selected 44 probes (red dots in Fig. S9) with negative third 
PC scores. However, close inspection of the distribution of 
the third PC scores highlighted a slight hump in the distribu- 
tion of the third PC scores (red regions and a red arrow in 
Fig. SIO). Thus, we selected probes that belonged to this 
small hump as outliers, and as the probes that were differ- 
ently expressed between disease and control groups. Third, 
the observed differences of the third PC scores between dis- 
ease and control groups were dependent upon gender (Fig. 
Sll(b)). Although female twins (red) had larger third PC 
scores for DM twins (O) compared with healthy twins (A), 
male twins (black) had smaller PC scores for DM twins (O) 
compared with healthy twins (A). This gender-dependence 
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makes it difficult to identify significant differences between 
disease and control groups, because samples with equal 
number of males and females can have lower third PC 
scores. This possibly also reduced the amount of methyla- 
tion/demethylation differences between control and disease 
groups. Differential methylation is reversed between males 
(ID: 16 and 33 in Fig. S12) and females (ID: 103, 127, and 
129 in Fig. S12). This may explain why Javierre et al. [6] did 
not find probes expressed differently and significantly be- 
tween control and DM groups. Thus, the PCA-based method 
can select promoters that are methylated differently between 
normal twins and diseased twins from DM samples. 

Comparisons with the Previous Feature Selection 
Methods and Comparisons with other Data Sets 

Comparisons with the previous feature selection methods 
and comparisons with other (independent or validation) data 
set were not discussed here but were better to be discussed. 
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However, since it is lengthy and is not directly related to the 
outcome obtained in the present study, we moved these to 
supporting information. As a result, we confirmed that our 
methods outperformed previously proposed methods [13, 14] 
(no previously proposed methods tested provided us 
commonly selected genes for three diseases), and our 
methods can work even if it is applied to other (only SLE 
[15] and RA [16], no DM) samples. 

Protein Structures Predicted by FAMS 

To understand the functions and structures of the selected 
genes, we applied FAMS to the genes selected by PCA. In 
Table 2, we listed the genes selected in the present study. 
These genes were used as reference proteins for FAMS. To- 
gether with the reference proteins, we listed the model pro- 
teins that were inferred to have similar structures to each of 
the selected genes by FAMS. 



Table 2. Selected genes and model proteins used for structure prediction. Bold ID of PDB indicates that the reference protein itself was 
detected in PDB. 



Reference gene 
symbol 


Model 
PDB ID 


p-value 


gene symbol 


AIM2 


3RN5_A 


4 X 10 '" 


INTERFERON -INDUCIBLE PROTEIN AIM2 


CARD15 


3CIY_B 


1 X 10"'* 


TOLL-LIKE RECEPTOR 4, VARIABLE LYMPHOCYTE (TLR4) 


CD82 


2BG9 A 


0.48 


ACETYLCHOLINE RECEPTOR PROTEIN, ALPHA CHAIN 


CSFIR 


3B43 A 


5 X 10"'* 


TITTN 


CSF3 


1GNC_A 


8 X 10 " 


GRANULOCYTE COLONY-STIMULATING FACTOR 


CSF3R 


3DMK_A 


4 X 10 '" 


DOWN SYNDROME CELL ADHESION MOLECULE (DSCAM) 


DHCR24 


2Q4W_A 


1 X 10"" 


CYTOKININ DEHYDROGENASE 7 (CK07) 


ERCC3 


2W74_D 


1 X 10 


TYPE I RESTRICTION ENZYME ECOR124n R PROTEIN (HSDR) 


GRB7 


3HK0_B 


3 X 10 " 


GROWTH FACTOR RECEPTOR-BOUND PROTEIN 10 (GRBIO) 


HGF 


4DUU_A 


1 X 10™ 


PLASMINOGEN 


H0XB2 


2D5V_A 


1 X 10 " 


HEPATOCYTENUCLEAR FACTOR 6 (HNF-6) 


IFNGR2 


3T1W_A 


4 X 10 "' 


FOUR-DOMAIN FIBRONECTIN FRAGMENT 


LCN2 


1X71_A 


4 X 10 " 


NEUTROPHIL GELATINASE-ASSOCIATED LIPOCALIN (NGAL) 


LM02 


2XJY_A 


2 X 10 " 


RHOMBOTIN-2 


LTB4R 


2KS9_A 


2 X 10"" 


SUBSTANCE-P RECEPTOR 


MMP14 


1SU3_B 


1 X 10"'" 


INTERSTITIAL COLLAGENASE (MMP-1) 


MMP8 


1SU3_B 


1 X 10""' 


INTERSTITIAL COLLAGENASE (MMP-1) 


MPL 


3L5H_A 


8 X 10"" 


INTERLEUKIN-6 RECEPTOR SUBUNIT BETA (IL6RB) 


PADI4 


2DEW_X 


0.0 


PROTEIN- ARGININEDEIMINASE TYPE IV 


PhCAMI 


3L)MK_A 


1 X 10 


DOWN SYNDROVlb CELL ADHESION MOLHCULh (DSCAM) 


PI3 


1TWP_A 


2 X 10" 


WHEY ACIDIC PROTEIN (WAP) 


RARA 


3DZY_A 


4 X 10"" 


RETINOIC ACID RECEPTOR RXR-ALPHA 
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Reference gene 
symbol 


Model 
PDBID 


P-value 


gene symbol 


S100A2 


2RGI_A 


4 X 10" 


PROTEIN S100-A2 


SEPT9 


3FTQ_A 


1 X 10 


SEPTIN-2 


SLC22A18 


1PW4_A 


1 X 10'°* 


GLYCEROL-3-PHOSPHArE TRANSPORTER 


SPIl 


1GVJ_B 


1 X 10 " 


C-ETS-1 PROTEIN (ETSl) 


SPPl 


1D2T_A 


1 X 10 


ACID PHOSPHATASE (ACP) 


STAT5A 


1Y1U_A 


0.0 


SIGNAL TRANSDUCER AND ACTIVATOR OF TRANSCRIPTION (STATS A) 


SYK 


20Z0_A 


1 X 10 '™ 


TYROSINE-PROTEIN KINASE ZAP-70 


ribi 


3L)V1R_A 


2 X 10 


DOWN SYNDROMb CbLL ADllbSION MObbCUbb (L)SCAM) 


TM7SF3 


1AR1_A 


6 X 10 


CYTOCHROME C OXIDASE 


TRIP6 


1B8T_A 


2 X 10 " 


CYSTEINE-RICH PROTEIN 1 (CRPl) 


VAMP8 


2K0G_A 


1 X 10 " 


VESICLE-ASSOCIATED MEMBRANE PROTEIN 2 (VAMP2) 



First, FAMS successfully listed the model proteins with 
small P- values for most of the reference proteins. Fig. SI 3 
shows a representative example of the combination of a 
model protein and a reference protein: the model protein 
2OQ0_B and the reference protein AIM2. Regions of align- 
ment included a 1 92 amino acid sequence (total length = 209 
amino acids) of 2OQ0_B and 191 amino acid sequence (total 
length = 343 amino acids) of AIM2. Sequence similarity 
between the two alignment regions was 44 %. P- value at- 
tributed was 2 X 10"'°. 2OQ0_B is annotated as IFI-16. Al- 
though AIM2 itself is included in the present PDB, the struc- 
ture predicted by FAMS using IFI-16 as a model protein was 
very similar to the true structure, although sequence similar- 
ity between AIM2 and IFI-16 is less than 50 %. This demon- 
strated that structure prediction using FAMS is very accu- 
rate. Although this is only one example of a typical relation- 
ship between model/reference proteins, it was generally rep- 
resentative of the quality of structural similarities we ob- 
served for other proteins. This suggests that the structural 
homology between models and references is reliable. 

DISCUSSION 

First, we will discuss the analysis of our results to deter- 
mine whether our selection process was correct. Then, the 
possibilities of the application to drug discovery will be dis- 
cussed. 

Comparison with Previous Studies and between Dis- 
eases 

We compared our results with Javierre et al. [6] (Table 
3). It is clear that there are substantial overlaps between our 
studies and Javierre et al. [6]. If one considers that different 
methods and/or samples were employed, this coincidence is 
more than remarkable. It may suggest that our analysis was 
correct. Figure 1 shows the overlap between the distinct dis- 
eases. Again, the coincidence is very high. Thus, we can 



conclude that for the first time we could successfully identify 
commonly methylated promoters for more than one autoim- 
mime disease. 




DM 135(p 

Figure 1. Venn diagram of selected probes between diseases. 

Biological Relevance of Selected Proteins 

In this subsection, we determine whether the selection of 
proteins was coincident with what was previously known, 
which is schematically summarized in (Fig. 2). 

Table 3. A comparison between previous studies (*) [6] and the 
present work. Numbers indicate the number of over- 
laps between previous studies and the present work. 
Numbers in parentheses represent the total number of 
probes that were regarded as being expressed differ- 
ently between disease and control groups in the pre- 
sent work. 



SLE(*) 


SLE 


RA 


DM 


54 


51(58) 


48(53) 


37 (44) 



Recently, O'Hanlon et al. [9] investigated plasma pro- 
teomic profiles from disease-discordant MZ twins (four pairs 
discordant for SLE, four pairs discordant for juvenile idio- 
pathic arthritis (juvenile RA) and two pairs discordant for 
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SNARE complex 
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'STX17)(^AMP8) 



Figure 2. Suggested relation between proteins selected in the present study and those reported in previous studies. Grey ovals: present study, 
black ovals: O'Hanlon etal. [9] and a broken oval: O'Hanlon et al. [29]. For more details, see text. 



juvenile DM), and found that several proteins were differ- 
ently expressed between patients and normal controls. 
Among several proteins found, O'Hanlon et al. [9] identified 
11 genes (STX17, MGAM, PONl, C6, SYNEl, PLEKHG5, 
ZNA2GP, LRGl, PKDl and AP0A2) as important. Many 
clinical studies suggested that these proteins (genes) are re- 
lated to the genes selected by the present study. For example, 
GCSF was reported to upregulate LRGl [17], while LRGl 
was upregulated in patients as shown by O'Hanlon et al. [9]. 
LRGl and GCSF were also upregulated in chemia- 
conditioned mice [18]. Ai ef al. [19] also discussed the tight 
relationship between G-CSF and LRGl, which is located 
downstream of PU.l signaling pathways. PU.l is targeted by 
RARA [20], one of the promoters we identified as being 
differently methylated in patients compared with normal 
controls. (Table 2, see also Kyoto Encyclopedia of Genes 
and Genomes (KEGG) pathway hsa05200). 

LGRl is upregulated during neutrophil activation. Neu- 
trophil granulocytes are the most abundant type of white 
blood cells in mammals and form an essential part of the 
innate immune system. It is notable that LCN2 (NGAL : 
neutrophil gelatinase-associated lipocalin) was also selected 
by our study. 

Thus, the selection of CSFIR, CSF3, and CSF3R as criti- 
cal genes by our research is not accidental but biologically 
meaningful. 

O'Hanlon et al. [9] also emphasized the importance of 
PONl, a gene that causes IFIl 6-mediated expression 
changes in endothelial cells [21]. IFIl 6 was reported to have 
a similar structure to AIM2 in our research (see Fig. S13). 

Fischer et al. [22] confirmed that C6 is a toll-like recep- 
tor (TLR) 4 agonist. C6 is expressed in autoimmune disease 
patients and is involved in the key pathway related to auto- 
immune disease [9], while TLR4 was found to have a similar 
structure to CARD 15 by our study. 



Syntaxin 17 (STX17) protein was highlighted by 
O'Hanlon et al. [9]. Syntaxin forms a SNARE complex with 
VAMP [23], and VAMPS was recognized by our analysis. 
VAMPS and STX17 signal in the KEGG pathway "SNARE 
interactions in vesicular transport" (hsa04130). SNARE 
complexes including VAMPS were reported to be related to 
the autoimmune disease, Sjogren's syndrome [24], and was 
proposed to be a novel target for the development of new 
therapies to treat allergic and autoimmune disease [25]. 
Gordon et al. [26] also studied SNARE complex structures 
including STX17 and VAMPS using a newly developed as- 
say. They experimentally confirmed the various functions of 
SNARE complexes although they could not identify the spe- 
cific function of either STX17 or VAMPS for cell secretion 
processes. Thus, VAMPS is expected to be a potential drug 
target for treatment of autoimmune diseases. 

Mashima et al. [27] reported that irf2 deficiency caused 
low serum levels of elastase-1 (ELAl) mediated by 
SNARES. PI3 binds to ELAl [2S]. Thus, it is likely that PI3 
is involved in mediating the functions of SNARE. 

O'Hanlon et al. [29] also investigated gene expression 
profiles. RNA microarray analyses (Agilent Human 1A(V2) 
20K oligo arrays) were used to quantify gene expression in 
peripheral blood cells from 20 MZ twin pairs discordant for 
systemic autoimmune diseases. The cohort consisted of six 
affected probands with SLE, six with RA, eight with idio- 
pathic inflammatory myopathies, and their same-gender un- 
affected twins. O'Hanlon et al. [29] confirmed that several 
genes' expression could be distinguished between normal 
controls and patients. Genes that O'Hanlon et al. [29] identi- 
fied as important were TNFAIP6, TNFSFIO, MAP2K6, 
ILIRN, IFI27, ANXA3, CEA- CAM6, DEFA4, EIF2AK2, 
FCERAl, SYTL2, LGR6, KRTCAP2, LTK, and FYN. 
Among these genes, TNFAIP6 was observed to be co- 
expressed with AIM2 during the first 10 weeks of therapy 
with Pegylated-interferon-alfa2b (Peglntron™ ) and ribavirin 
(administered by weight) in HCV patients [30]. 
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HGF and TIEl are receptor tyrosine kinases. The over- 
expression of HGF results in the up- regulation of TIEl and 
PECAM in the 3T3-F442A mouse cell line [31]. PECAMl 
functions in the KEGG pathway for leukocyte transcndothe- 
hal migration (hsa04670). TIEl upregulates the cell adhesion 
molecules (CAMs) VCAM-1, E-selectin, and ICAM-1 
through a p38-dependent mechanism. VCAMl mediates 
adhesion between leukocytes and endothelium prior to leu- 
kocyte transendothelial migration. This is important as mi- 
grating leukocytes cause RA. However, HGF downregulates 
VEGF-mediated expression of ICAM-1 and VCAM-1 at the 
transcriptional level [32]. The association between HGF and 
CD82 has also been reported [33, 34]. Thus, it is likely that 
HGF, TIEl, PECAMl and CD82 work cooperatively and 
may be potential drug targets for the treatment of autoim- 
mune diseases. 

The Nucleotide Oligomerization Domain (NOD)-like 
receptor (NLR) signaling pathway (hsa04621) includes 
N0D2 (CARD 15) and TRIP6. NLRs are cytoplasmic pro- 
teins with a variety of functions for the regulation of inflam- 
matory and apoptotic responses. CARD 15 has been identi- 
fied as a risk factor gene for Crohn's disease [2]. 

SYK is a spleen tyrosine kinase [EC:2.7.10.2], which 
functions in many immune-related KEGG pathways such as 
natural killer cell mediated c5^otoxicity (ko04650), B cell 
receptor signaling (ko04662), Fc epsilon RI signaling 
(ko04664), and Fc gamma R-mediated phagocytosis 
(ko04666). 

Natural killer cell mediated cytotoxicity is the directed 
killing of a target cell by a natural killer cell through the re- 
lease of granules containing cytotoxic mediators or through 
the engagement of death receptors. 

Taken together, a remarkable number of proteins among 
those selected by the present study have strong relationships 
with proteins previously reported to be related to the auto- 
immune disease- related biological processes. 

Biological Significance of Inference by FAMS 

In this subsection, we determined whether the biological 
features attributed to the model proteins selected by FAMS 
(Table 2) were reasonable. Because of space hmitations, we 
carmot explain all of them singly, therefore we will discuss 
selected examples. 

TRIP6 is expected to have a similar structure to CRPl, 
which is involved in immune responses [35]. TM7SF3 is a 
cytochrome c oxidase, which was reported to bind to im- 
mune gamma-globulins [36]. TIEl, PECAMl and CSF3R 
form Down Syndrome Cell Adhesion Molecule (DSCAM), 
which is an immunoglobulin (Ig)-superfamily receptor in 
insects [37]. SYK is a tyrosine-protein kinase ZAP-70 and 
both SYK and ZAP-70 display distinct requirements for Src 
family kinases in immune response receptor signal transduc- 
tion [38]. STAT5A is found in PDB, which has a critical role 
in cytokine responses and normal immune functions [39]. 
SPIl is ETSl, and is expressed in SLE and functions in the 
immune system [40]. S100-A2 is in PDB and is reported to 
be an anti-body that inhibits the receptor for advanced glyca- 
tion end products (RAGE) ligands [41]. RARA is structur- 
ally similar to RXR-a, which is involved in inflammatory 



responses [42]. PI3 is a WAP, which is important in innate 
immunity [43]. PAD 14 is in PDB and is important in RA 
[44]. The structure of MPL structure was inferred to be simi- 
lar to IL6RB. IL6R is a key mediator of RA [45]. LCN2, 
also called NGAL, is in PDB. NGAL has been tested as a 
marker of inflammatory status for the early diagnosis of in- 
flammatory diseases such as DS [46]. HNF-6 is a H0XB2 
model proteins and causes immunologically distinct features 
[47]. AIM2 is structurally similar to IFI16. AIM2 and IFI16 
have critical roles in immunology [48]. CARD 15 was in- 
ferred to be similar to TLR4, which has a role in cell antivi- 
ral responses together with TLR3:TICAM1 -specific signal- 
ing pathways [49]. CD82 is an acetylcholine receptor protein 
that has functions in the immune system [50]'. CSFIR is 
assigned as TITAN, and is involved in immune responses 
[51]. SPPl is an acid phosphatase and is related to autoim- 
mune prostatitis [52]. LM02, (rhombotin-2), is related to 
ZFAT (a zinc-finger gene in autoimmune thyroid disease 
susceptibility region) an immune-related transcriptional 
regulator containing 18 C2H2-type zinc-finger domains and 
one AT-hook) [53]. DHCR24 is a cytokinin dehydrogenase. 
Cytokines have been referred to as immunomodulating 
agents. SEPT9 is homologous to septin-2, which is upregu- 
latcd in cytoskeletal and immune function-related proteome 
profiles [54]. IFNGR2 is a four-domain fibronectin frag- 
ment, which plays a role in autoimmune diseases [55]. CSF3 
is in PDB, and is related to immune system functions [56]. 
GRB7 (GRB 1 0) has roles in the immune system and in can- 
cer [57]. HGF is related to plasminogen, which plays critical 
roles in autoimmune diseases together with matrix metallo- 
proteinase (MMP) 9 [58]. LTB4R is a substance-P receptor, 
which mediates immune responses to respiratory syncytial 
virus infection [59]. 

This is only a partial list of the immune system-related 
features that were attributed to each model protein selected 
by FAMS. Although more examples could be listed, we have 
omitted them because of space restrictions. Such a coinci- 
dence that most of the selected model proteins have some 
relationship to immune-related molecular functions cannot 
be accidental. This suggests that the three-dimensional struc- 
tures inferred by FAMS have the potential for autoimmune 
disease-related drug discovery. 

Possibility of Drug Discovery 

To regard the selected proteins as drug targets, it is re- 
markable to find that all of the three-dimensional structures 
predicted by FAMS are immune-related. However, it is im- 
portant to use this information for drug discovery by FAMS. 
In this subsection, we would like to demonstrate the possi- 
bilities of drug discovery for the proteins selected in the pre- 
sent study. 

Ligand Binding to "Pocket" 

The most popular method of identifying drugs is to find a 
small molecule to bind a "pocket" of each protein. If FAMS 
can identify or suggest such a candidate for each gene in 
Table 2, it will be very useful. 



'Although the P-value attributed to CD82 was not small enough, the reliability of this 
assignment was reasonable after ftirther consideration (data not shown). 
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For example, two proteins, MMP8 and MMP14, are 
shown in Table 2. They coregulate target genes [60]. Both of 
them are members of the MMP family, which is related to 
inflammation. For MMP8, using IXUC A, which is MMP- 
13, as a template, FAMS successfully showed that there 
might be many ligands that bind to MMP8 (Fig. S14). Simi- 
larly, for MMP14, using IBQO B, which is MMP-3, as a 
template, FAMS successfully showed many ligands could 
also bind MMP14 (Fig. S15). Although this cannot be de- 
scribed as finding a new drug, it does show the potential for 
proteins listed in Table 2 that may be novel drug targets. 

To determine the possibility of "/« silico" ligand screen- 
ing, we employed MMP8. For MMP8, we identified three 
PDB structures for model proteins, lA86_pld, 1173_pld, and 
IMMB BAT. Table 4 shows the correlation coefficients 
between FPAscores and Ki values for 15 hgand candidates 
used in the assay CHEMBL711322. In this calculation, we 
calculated the mean of the values obtained by more than one 
model protein that were indicated by three-digit numbers. 
For example, 100 represents the consideration of only 
lA86_pld and the exclusion of 1173_pld and IMMB BAT. 
Figure S16 shows the scatter plots of FPAscores and Ki val- 
ues. Table SI shows the FPAscores and values for ligand 
candidates. Table 4 suggests that the mean over two pairs, 
i.e., lA86_pld and 1173_pld, or lA86_pld and IMMB BAT, 
can provide a significant correlation independent of the types 
of correlation coefficients considered. 

Based on the results obtained here, we can expect possi- 
ble "in silico" screenings for drug discovery. 

Inhibition of Protein Complex Formation 

Another new possibility for drug targets is the inhibition 
of protein complex formation. Many proteins cannot func- 
tion as a single substance and must form protein complexes 
with other proteins. Thus, if we inhibit the protein complex 
formation, we can also inhibit the fiinction of the protein 
complex. Table 5 lists protein complex candidates inferred 
by FAMS. Since FAMS uses a representative protein within 
each cluster that has more than 95 % sequence similarity as a 
model protein, there are often more than a thousand model 



proteins, which can bind to other proteins. However, it can 
be seen that the list includes many reasonable outcomes. For 
example, there are 52 model protein complexes listed when 
using both CSF3 and CSF3R as reference proteins. Just by 
the name, it is obvious that they are possibly a ligand and its 
receptor. However, there are 186 model proteins between 
CSF3R and CSFIR. This represents the possibility that each 
monomer can form protein complexes that can fiinction to- 
gether, possibly as a receptor. In addition to this, both 
CSF3R and CSFIR most frequently have non-zero model 
proteins that bind to other reference proteins. This is reason- 
able since many proteins can bind as a ligand or can form a 
receptor. Further analysis of this table will give us fiuitfiil 
information to find drug targets by inhibiting the formation 
of protein complexes. 

In addition to these known protein complexes, there are 
many new candidates for protein complex formation. Figure 
S17 shows one such possible candidate. In Table 6, we com- 
puted binding energy with FiberDock software [61]. Ob- 
tained binding energy was significantly small enough to 
form protein complex. 

In Table 5, there are 410 possible candidate pairs be- 
tween CSFIR and PECAMl . Among these, there is one pair 
having 61 atom pairs in contact with each other. This means 
there is a structure on PDB (2ZJS) that includes monomers 
whose protein structures are expected to be similar to CSFIR 
and PECAMl, respectively. 2ZJS is a SecYE translocon, 
which functions as a protein-conducting channel [62]. Al- 
though this protein complex was found in Thermus thermo- 
philus, since this protein is expected to be highly conserved, 
it is possible that CSFIR and PECAMl form a protein com- 
plex that is secreted across or integrated into membranes and 
plays a critical role in human autoimmune diseases. Thus, if 
we can identify a drug that inhibits the formation of a protein 
complex between CSFIR and PECAMl, it may be used to 
treat autoimmune diseases. 

Although many more protein complex formation candi- 
dates were detected, we carmot report all of them here be- 
cause of space limitations. This will be reported in the future. 



Table4. A comparison between the FPAscores and Ki values. Three-digit numbers indicated which of the three model PDB files, i.e., 
lA86_pld, lI73_pld,lMMB_BAT, were considered for choosing LD computation. AdjustedP-valueswereP-valuesadjustedby 
Benjamin! and Hochberg[63]. Bold numbers indicate adjusted P-values less than 0.05. 
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DHCR24_P652_R 
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HGF_E102_R 
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LTB4R_P163_F 

MMPU_P13_F 

MMP8_E89_R 

MPL_P62_F 

PADW_E24_F 

PECAM1_E32_R 

PI3_P27'1_R 

RARA_P1076_R 

S100A2_E36_R 

SEPT9_P37<l_F 

SLC22AI8_P2I6_ 

SP1I_E205_F 

SPP1_P647_F 

STAr5A_P704_R 

SYK_P584_F 

TIE1_E66_R 

TM7SF3_P1068_R 

TRIP6_P1090_F 

VAMP8_P241_F 



Why is PCA Useful to Infer Differential Expression 
between Control and Disease? 

To our knowledge, this is the first report to describe 
commonly methylated promoters between more than one 
autoimmune disease and healthy controls. In this subsection, 
we argue why it is difficult to find commonly methylated 
promoters between more than one autoimmune disease and 
healthy controls and why our PCA-based method was suc- 
cessful. 

First, to determine whether individual promoters are dif- 
ferently methylated between control and disease groups by 
the conventional statistical test, more than one replicate is 
required. However, as shown by Javierre et al. [6], when 
twins are used, it is impossible to obtain more than one repli- 
cate, because they are twins. Twins consist of two geneti- 
cally identical humans. Thus, if one has the disease and the 
other does not, there is no way to obtain additional geneti- 
cally identical individuals. In this case, an additional normal 
control must be used to obtain biological replicates. How- 
ever, since additional subjects will have a different genome 
from the twins, the addition of non-twin control samples 
weakens the ability of detection of significance. 

Another method to apply the statistical test to the set of 
probes is to divide the probes into two classes: differently 
expressed probes and others. However, to find suitable clas- 
sifications with which to divide probes to differently ex- 
pressed probes and others, multiple trials and errors are re- 
quired. These trials weaken the sensitivity of the test because 
of the correction due to the multiple comparisons. 



Although these are general problems, there are also some 
autoimmune-specific problems. For example, there is a 
strong gender-dependence of promoter methylation related to 
autoimmune diseases. When PCA is applied to the whole 
sample, the second PC is related to X chromosome specific 
methylation (Fig. SI). While males have only one X chro- 
mosome, females have two. Usually, this difference is sus- 
tained by methylation of one pair of the female X chromo- 
somes. Thus, one has to exclude genes located on the X 
chromosome. However, differential promoter methylation is 
also observed for DM between different genders (see sec- 
tion, "PCA apphcation to DM samples"). This can also 
weaken the sensitivity of statistical tests. 

Usually, there is no way to detect these dependencies in 
advance. However, PCA correctly detected these and se- 
lected differently expressed probes considering these gender 
specificities. Thus, PCA can separate different dependencies 
from each other. For example, for DM, the second PC repre- 
sented gender specific promoter methylation (Fig. Sll(a)) 
while the third PC reflected differential methylation between 
healthy twins and DM twins (Fig. Sll(b)). It even automati- 
cally reflected the reversed methylation dependent upon 
gender. For RA, the subgroup with ID: 85 did not have dif- 
ferential methylation and it was automatically reflected 
(Figs. S7(b)). 

PCA does not assume pre-defined classification but re- 
fiects classifications with larger inter- class differences. If 
there are some classes accompanied with differentially 
methylation, it is probable that PCA can detect them. This is 
not guaranteed, but it worked in the present research. This is 
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an explanation for why this study could detect differences by 
PCA-based promoter methylation selection. 

CONCLUSIONS 

In this study, we applied PCA-based gene selection to 
promoter methylation measurements to identify promoters of 
genes differently methylated between healthy twins and 
autoimmune diseased twins. Significant numbers of gene 
promoters were commonly demethylated between distinct 
autoimmune diseases, but differently between healthy twins 
and diseased twins. The genes whose promoters were com- 
monly methylated had a remarkable relationship to genes 
that were previously shown as related to autoimmune dis- 
eases. Their three-dimensional protein structures were also 
inferred by FAMS. Most model proteins selected to infer the 
three-dimensional structures were immune-related proteins. 
This fact reveals that the inference by FAMS was biologi- 
cally accurate. The possible apphcations of the inference by 
FAMS for drug discovery were discussed. The inference by 
FAMS may therefore be useful for both ligand discovery and 
the search for composites that inhibit protein complex forma- 
tion, which may promote autoimmune disease. 
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